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ABSTRACT 

This  paper  introduces  the  technique  of  anchor  modeling  in 
the  applications  of  speaker  detection  and  speaker  indexing. 
The  anchor  modeling  algorithm  is  refined  by  pruning  the 
number  of  models  needed.  The  system  is  applied  to  the 
speaker  detection  problem  where  its  performance  is  shown 
to  fall  short  of  the  state-of-the-art  Gaussian  Mixture  Model 
with  Universal  Background  Model  (GMM-UBM)  system. 
However,  it  is  further  shown  that  its  computational  efficiency 
lends  itself  to  speaker  indexing  for  searching  large  audio 
databases  for  desired  speakers.  Here,  excessive  computa¬ 
tion  may  prohibit  the  use  of  the  GMM-UBM  recognition 
system.  Finally,  the  paper  presents  a  method  for  cascading 
anchor  model  and  GMM-UBM  detectors  for  speaker  index¬ 
ing.  This  approach  benefits  from  the  efficiency  of  anchor 
modeling  and  high  accuracy  of  GMM-UBM  recognition. 

1.  INTRODUCTION 

This  paper  describes  a  method  of  representing  and  charac¬ 
terizing  a  target  utterance  with  information  gained  from  a 
set  of  anchor  models  derived  from  a  predetermined  set  of 
speakers.  Since  the  speakers  of  the  target  utterances  are  not 
members  of  the  model  training  set,  the  system  is  capable  of 
characterizing  the  target  speaker  with  no  prior  knowledge 
of  that  speaker.  Previous  research  [1,2]  suggests  that  the 
target  speaker  will  be  projected  into  a  talker  space  defined 
by  the  anchor  models.  Since  the  models  are  created  only 
once  in  the  training  phase,  it  is  unnecessary  to  train  a  model 
for  a  new  target  speaker.  Applications  of  the  approach  in¬ 
clude  speaker  recognition,  speaker  detection,  and  speaker 
clustering  for  very  large  speaker  populations  where  it  is  un¬ 
desirable  or  infeasible  to  train  models  for  every  member  of 
the  target  population.  Another  application  of  anchor  mod¬ 
eling  discussed  in  this  paper  is  speaker  indexing;  that  is, 
the  use  of  speaker  detection  for  the  retrospective  searching 
of  large  speech  archives.  For  large  archives,  current  state- 
of-the-art  speaker  recognition  systems  may  be  too  compu¬ 
tationally  inefficient  for  large  searches.  The  efficiency  of 
the  anchor  system  lends  itself  to  the  application  of  large 
speech  archive  retrieval.  It  is  shown  that  although  the  de¬ 
tection  performance  of  the  anchor  model  system  falls  short 
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of  state-of-the-art  Gaussian  Mixture  Model  with  Universal 
Background  Model  (GMM-UBM)  speaker  detection  sys¬ 
tems  [3,  4],  the  efficiency  of  anchor  modeling  can  be  ef¬ 
fectively  exploited  by  embedding  it  in  a  two-stage  cascaded 
system,  where  the  role  of  the  anchor  system  is  to  reduce 
the  data  load  of  the  more  accurate  but  less  computationally 
efficient  GMM-UBM. 


2.  ANCHOR  MODELS 


The  basic  concept  of  anchor  modeling  is  the  representation 
of  a  target  speech  utterance  with  information  gained  from  a 
set  of  models  pre-trained  from  a  defined  set  of  talkers  .  In 
theory,  the  models  could  consist  of  virtually  any  method  of 
speech  representation.  Previous  work  [1,2]  used  speaker- 
dependent  Hidden  Markov  Models  (HMM)  as  the  anchors. 
This  study  uses  the  GMM-UBM  as  the  representation  model 
for  forming  the  anchors. 

Segments  of  speech,  s,  are  scored  against  a  set  of  pre¬ 
trained  anchor  models,  _4,;,  i  =  1,  ...,7V.  Each  of  the  TV  an¬ 
chor  models  yields  a  likelihood  score  and  the  collection  of 
scores  is  used  to  form  the  TV-dimensional  characterization 
vector.  The  speech  utterance  is  represented  by  this  charac¬ 
terization  vector  V,  where 


V  = 


p{s  Ax) 
p(s  A2) 


L  p(s\An)  J 


(1) 


The  characterization  vector  can  be  considered  a  projection 
of  the  target  utterance  into  a  speaker  space  defined  by  the 
anchor  models.  If  an  utterance  from  a  single  speaker  projects 
into  a  unique  portion  of  the  speaker  space,  then  the  speaker 
representation  is  unique.  Speaker  detection  is  performed  by 
considering  the  location  of  the  vectors  within  this  speaker 
space. 

Speech  segments  are  compared  by  scoring  a  speech  seg¬ 
ment  su  from  an  unknown  speaker  and  a  speech  segment  s  / 
from  a  target  speaker  against  the  same  set  of  anchor  mod¬ 
els  (Figure  1),  thereby  forming  two  characterization  vectors, 
Vu  and  Vi ,  to  represent  the  unknown  and  target  segments 
of  speech.  A  vector  distance  is  then  used  to  compare  the 
speech  segments. 

Preliminary  experiments  using  Euclidean,  absolute  value 
or  “city  block”,  and  Kullback  -  Leibler  distance  measures 
showed  that  Euclidean  distance  performed  best.  Unit  nor- 
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Figure  1 :  The  anchor  model  system. 

malizing  the  elements  of  characterization  vectors  in  the  dis¬ 
tance  calculation  did  not  change  performance. 

The  GMM-UBM  anchor  models  described  in  this  paper 
were  trained  using  speech  from  668  talkers  in  the  NIST- 
1996  and  NIST- 1999  speech  corpora. 1  The  GMM-UBM 
algorithm  used  was  the  same  as  that  developed  for  the  NIST- 
2000  speaker  recognition  workshop  [5, 6]  but  without  speaker 
(T-NORM)  and  handset  (H-NORM)  normalizations. 

2.1.  Anchor  Model  Pruning 


Figure  2:  DET  curves  for  the  GMM-UBM  and  anchor 
model  system  using  the  primary  condition  of  single  speaker 
detection  NIST-2000  speech  corpus. 


The  full  anchor  model  characterization  vector  is  formed  by 
scoring  an  utterance  against  all  668  anchor  models.  Meth¬ 
ods  of  reducing  the  size  of  the  Euclidean  distance  compar¬ 
ison  were  investigated  in  an  effort  to  increase  performance 
by  using  only  those  anchor  models  that  provide  good  char¬ 
acterizing  information.  Reducing  the  size  of  the  distance 
comparison  reduces  the  dimensionality  of  the  speaker  space 
and  increases  computational  efficiency. 

Model  pruning  strategies  were  motivated  by  the  obser¬ 
vation  that  the  vector  distance  between  characterization  vec¬ 
tors  derived  from  the  same  talker  should  be  small  while  dis¬ 
tances  between  characterization  vectors  of  different  speak¬ 
ers  should  be  large.  Characterization  vectors  of  two  utter¬ 
ances  from  the  same  talker  were  compared  and  the  result¬ 
ing  element  distances,  c/,.  were  rank  ordered  by  magnitude, 
where 


corpus  into  test  and  training  sets  and  performing  the  evalu¬ 
ation  using  the  protocols  stipulated  in  [7],  Figure  2  presents 
the  Detection  Error  Tradeoff  (DET)  curves  for  the  NIST- 
2000  single  speaker  detection  task  primary  condition. 2  The 
equal  error  rate  for  the  anchor  model  system  using  the  full 
characterization  vector  (TV  =  668)  was  24.2%  while  the 
equal  error  rate  of  the  anchor  system  with  model  pruning 
was  21.4%.  Pruning  of  the  models  provides  a  relative  per¬ 
formance  increase  of  1 1.7%. 

The  performance  of  the  anchor  system  falls  well  short  of 
the  7.7%  equal  error  rate  of  the  GMM-UBM  system.  The 
next  section  discusses  one  application  of  speaker  detection 
where  the  computational  efficiency  of  the  anchor  modeling 
approach  is  used  to  advantage. 

4.  SPEAKER  INDEXING 


di  =  \(Vti  ~VUif]  (2) 

L  J  2=1:  IV 

and  Vt  and  Vu  are  two  characterization  vectors  obtained 
from  two  target  speech  utterances.  A  percentage  of  the  mod¬ 
els  with  the  lowest  element  distances  was  then  chosen  as 
the  anchor  model  set.  In  a  similar  manner,  characterization 
vectors  of  utterances  from  different  talkers  can  be  evaluated 
with  Equation  2,  where  Vt  and  Vu  are  now  characteriza¬ 
tion  vectors  from  different  talkers.  With  this  approach,  only 
those  models  with  the  largest  element  distances  are  chosen 
for  the  anchor  model  set.  Using  these  two  methods  of  prun¬ 
ing,  the  size  of  the  Euclidean  distance  comparison  was  re¬ 
duced  by  60%  while  the  equal  error  rate  was  improved. 

3.  SPEAKER  DETECTION  WITH  ANCHOR 
MODELS 


Speaker  indexing  is  defined  as  the  application  of  speaker  de¬ 
tection  to  the  retrospective  search  of  large  speech  archives. 
Two  possible  uses  of  speaker  indexing  are  the  clustering  of 
speech  messages  contained  in  a  speech  archive  and  the  re¬ 
trieval  of  a  list  of  messages  from  an  archive  in  response  to 
an  external  query.  This  paper  focuses  on  the  list  retrieval 
task.  Performance  in  speaker  detection  evaluations  has  tra¬ 
ditionally  been  reported  using  a  (prior-independent)  DET 
curve  that  describes  the  underlying  tradeoff  between  misses 
and  false  alarms  for  a  given  detector  and  corpus.  How¬ 
ever,  performance  in  information  retrieval  applications  such 
as  speaker  indexing  is  better  described  using  the  notions  of 
precision  and  recall.  Detection  theory  and  information  re¬ 
trieval  measures  are  related  as  follows:  Recall  is  the  propor¬ 
tion  of  relevant  material  retrieved  from  the  archive  and  so  is 
equal  to  the  detection  probability.  Precision  is  the  propor¬ 
tion  of  retrieved  material  that  is  relevant  and  is  given  by 


Results  presented  in  this  section  used  speech  data  from  the 
NIST-2000  Speaker  Recognition  Workshop,  sectioning  the 

1  The  data  used  in  the  NIST  evaluation  is  a  subset  of  the  Switchboard 
I-II  data  corpora. 


Precision  = 

Pt( 1 


Pm)  +  (1  —  Pt)Pfa 


(3) 


2The  NIST  “primary  condition”  uses  2  minute  training  segments  and 
15-45  second  test  segments  collected  with  an  electret  microphone. 


Figure  3:  Precision  versus  recall  plot  for  the  GMM-UBM, 
and  anchor  model,  with  Pt  =  9%. 

where  Pt  is  the  target  probability  (richness)  of  the  archive, 
Pm  is  the  probability  of  a  miss,  and  Pfa  is  the  probability 
of  false  alarm.  These  relationships  can  then  be  used  to  de¬ 
rive  speaker  indexing  performance  (in  terms  of  precision  vs. 
recall)  from  a  DET  plot  for  any  given  target  probability  Pt. 

4.1.  Evaluation  of  the  GMM  and  Anchor  models  for 
Speaker  Indexing 

Figure  3  shows  the  precision  versus  recall  tradeoff  for  the 
GMM-UBM  and  anchor  model  speaker  detectors  using  the 
DET  plots  of  Figure  2  (NIST-2000  speech  corpus)  and  an 
archive  richness  Pt  =  9%  (the  richness  of  the  NIST-2000 
corpus).  As  expected,  the  GMM-UBM  method  outperforms 
the  anchor  model.  It  is  worth  noting  that  the  curves  tend  to 
move  toward  the  upper  right  with  increasing  Pt  and  toward 
the  lower  left  with  decreasing  Pt . 

Another  measure  of  a  speaker  detector’s  value  for  speaker 
indexing  applications  is  its  computational  efficiency.  Here 
it  is  assumed  that  each  item  in  the  archive  is  represented  by 
a  model  (trained  off-line)  against  which  a  query  is  scored. 
For  the  GMM-UBM,  each  10ms  frame  of  the  query  is  first 
scored  against  the  2048 -component  universal  background 
model  and  then  against  5  components  of  each  of  the  archive 
models  [5].  For  anchor  model  based  speaker  indexing,  the 
query  is  first  converted  to  a  characterization  vector  by  scor¬ 
ing  it  against  the  668  anchor  GMMs.  The  resulting  charac¬ 
terization  vector  is  then  compared  to  each  archive  charac¬ 
terization  vector  (trained  off-line)  using  a  668-element  Eu¬ 
clidean  distance.  Figure  4  plots  the  number  of  38-dimensional 
Gaussian  computations  (or  equivalent)  required  for  a  1  minute 
query.  (It  is  assumed  that  the  computation  time  for  one  38- 
element  Gaussian  and  38  Euclidean  distances  are  equal.) 
The  plot  for  the  anchor  model  system  stays  flat  to  about 
106  because  the  computation  is  dominated  by  the  conver¬ 
sion  of  the  query  to  a  characterization  vector.  Note  that  this 
is  true  for  the  pmned  anchor  system  as  well.  It  is  apparent 
that  the  anchor  model  speaker  indexing  system  has  signifi¬ 
cant  computational  advantages  for  archives  containing  more 
than  about  1000  items.  It  should  be  noted  that  methods  exist 
for  speeding  up  the  computation  required  for  the  GMM  that 


Figure  4:  Plot  of  computational  efficiency  for  the  GMM- 
UBM  and  anchor  model  speaker  detectors. 


Figure  5:  Cascaded  speaker  detection  system. 


would  improve  the  efficiency  of  both  the  GMM-UBM  and 
anchor  model  systems. 

4.2.  Cascading 

Figures  3  and  4  show  the  tradeoff  of  computational  effi¬ 
ciency  versus  accuracy  for  speaker  indexing.  The  GMM- 
UBM  has  superior  detection  performance  while  the  anchor 
system  provides  the  computational  efficiency  that  is  essen¬ 
tial  when  searching  large  archives.  In  an  effort  to  gain  a  bet¬ 
ter  tradeoff  between  computational  performance  and  accu¬ 
racy,  the  anchor  and  GMM-UBM  speaker  detection  systems 
were  combined  in  a  cascade  as  shown  in  Figure  5.  The  ob¬ 
jective  of  cascading  is  to  construct  a  system  containing  the 
positive  aspects  of  both  algorithms.  The  anchor  model  is 
employed  in  the  first  stage  to  reduce  the  amount  of  compu¬ 
tational  loading  for  the  GMM-UBM  speaker  detection  sys¬ 
tem.  The  GMM-UBM  is  then  used  to  provide  maximum 
recognition  performance. 

To  evaluate  the  performance  of  the  cascade,  it  is  first 
necessary  to  identify  the  operating  point  of  the  anchor  sys¬ 
tem.  Define  q  to  be  the  fraction  of  the  archive  processed  by 
the  second  system  of  the  cascade  (i.e.,  the  probability  that 
the  first  system  declares  a  target).  Note  that  q  is  the  denom¬ 
inator  of  Equation  3: 

q  =  Pt(l  -  Pm)  +  (1  -  Pt)Pfa  (4) 

where  (1  —  Pm)  is  the  probability  of  detection  and  Pja  is  the 
probability  of  false  alarm  for  the  anchor  model  speaker  de¬ 
tector.  Given  that  the  richness  of  the  archive  (Pt)  is  defined 
by  the  application,  choosing  a  unique  value  for  q  identifies 
a  {Pfa,  Pm)  Pair  from  the  DET  curve  (Figure  2)  and  repre¬ 
sents  the  chosen  operating  point  for  the  anchor  system. 

The  precision  versus  recall  curve  for  the  cascaded  sys¬ 
tem  can  be  calculated  in  the  same  manner  as  in  Section  4.1. 
Figure  6  presents  precision  versus  recall  for  the  cascaded 


Figure  6:  Precision  versus  recall  plot  for  the  GMM-UBM, 
anchor  model,  and  cascaded  system  with  q  =  10%  and 
Pt  =  9%. 


Figure  7:  Estimated  number  of  Gaussian  (or  equivalent) 
computations,  1  minute  query. 

system  with  q  =  10%  and  an  archive  richness  of  Pt  =  9%. 
The  effect  of  the  cascade  is  to  slightly  reduce  the  perfor¬ 
mance  in  operating  regions  of  low  recall  and  to  drastically 
reduce  performance  in  regions  of  mid-to-high  recall,  rela¬ 
tive  to  the  GMM  system. 

Figure  7  displays  a  plot  of  the  estimated  computational 
efficiency  for  the  GMM-UBM,  anchor  model,  and  cascaded 
speaker  indexing  systems.  As  the  amount  of  reduction  in 
archive  size  increases  (smaller  q ),  the  computational  effi¬ 
ciency  of  the  cascaded  system  also  increases. 

5.  SUMMARY 

This  paper  presented  a  method  of  characterizing  a  segment 
of  a  talker’s  speech  with  information  gained  from  a  set  of 
pre -trained  anchor  models.  The  anchor  models  were  de¬ 
rived  from  a  set  of  predetermined  speakers.  Characteriza¬ 
tion  vectors  were  then  formed  by  scoring  the  target  speech 
segment  against  the  set  of  anchor  models.  A  method  for  re¬ 
fining  the  anchor  modeling  system  was  presented  increased 
recognition  performance. 


Anchor  modeling  was  then  applied  to  the  speaker  detec¬ 
tion  problem.  Detection  error  tradeoff  performance  showed 
that  the  anchor  modeling  system  fell  short  of  a  state-of- 
the-art  GMM-UBM  system.  It  was  further  shown  that  its 
computational  efficiency  was  superior  to  that  of  the  GMM- 
UBM.  Comparison  of  the  anchor  model  and  GMM-UBM 
systems  for  speaker  indexing  showed  a  similar  tradeoff  be¬ 
tween  precision  versus  recall  performance  and  computational 
efficiency. 

A  cascaded  speaker  indexing  system  was  proposed  that 
utilized  the  anchor  model  system  as  the  first  stage  and  the 
GMM-UBM  as  the  second  stage.  In  this  configuration,  the 
anchor  model  reduced  the  data  loading  on  the  GMM-UBM 
while  slightly  reducing  performance  in  operating  regions  of 
low  recall.  The  effect  of  the  cascaded  system  was  to  com¬ 
bine  the  advantages  of  both  systems  at  the  expense  of  some 
loss  in  both  computational  performance  and  detection  accu¬ 
racy.  For  large  archives,  the  recognition  performance  of  the 
anchor  system  and  the  lack  of  computational  efficiency  of 
the  GMM-UBM  system  could  preclude  their  application  to 
speaker  indexing.  The  cascaded  system  may  offer  a  viable 
solution  to  the  speaker  indexing  application. 
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